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Abstract. The Simple Knowledge Organization System (SKOS) is a 
\^ . standard model for controlled vocabularies on the Web. However, SKOS 

vocabularies often differ in terms of quality, which reduces their applica- 
bility across system boundaries. Here we investigate how we can support 
I— I ■ taxonomists in improving SKOS vocabularies by pointing out quality is- 

' sues that go beyond the integrity constraints defined in the SKOS specifi- 

^ , cation. We identified potential quantifiable quality issues and formalized 

^ ' them into computable quality checking functions that can find affected 

resources in a given SKOS vocabulary. We implemented these functions 
in the qSKOS quality assessment tool, analyzed 15 existing vocabularies, 
^ and found possible quality issues in all of them. 

G^ 

I 1 Introduction 

I ! 

\^ I The Simple Knowledge Organization System (SKOS) [13] is a standard model for 

■ sharing and linking controlled vocabularies (thesauri, classification systems, etc.) 

on the Web. Organizations like, e.g., the European Union^, the United Nations^, 
or the UK government"^ publish SKOS representations of their vocabularies on 
the Web so that they can easily be accessed by humans and machines. 

However, quality issues can affect the applicability of SKOS vocabularies for 
tasks such as query expansion, faceted browsing, or auto-completion, as in the 



X 

Cd ■ following examples: 



— AGROVOC defines concepts in 25 different languages. While most concepts 
have English labels attached, only 38% have German labels. This can be a 
problem for multilingual applications that rely on label translations. 

— An earlier version of the STW thesaurus (v8.06) contained 5 pairs of con- 
cepts with identical labels. As a result, the auto-complcte function of the 
online search interface suggested identical entries without disambiguation 
information. 



EuroVoc, http : / /eurovoc . europa . eu/ 

ACROVOC, http : //aims . f ao . org/ standards/agrovoc/about 

Integrated Public Sector Vocabulary (IPSV), http://doc.esd.org.uk/IPSV 



— The non-public thesaurus of the Austrian Armed Forces (LVAk) contains 11 
disconnected concept clusters. When confronted with these structures, the 
maintainers recognized them as data without practical significance. 

The SKOS specification defines a set of integrity conditions that state whether 
given data patterns are consistent with the SKOS model. Yet the SKOS integrity 
conditions fail to capture quality aspects like the ones above. The main reason 
lies in SKOS' "minimal commitment" approach. A standard that aims at cross- 
domain interoperability should refrain from defining constraints that impose on 
one domain the requirements of another. SKOS is thus very liberal with re- 
spect to data quality. On the other hand, each vocabulary should fulfill domain- 
and application-specific quality aspects and taxonomists often follow standard 
guidelines specific to given types of vocabularies (cf., [1, 15]) or apply their own 
hand-crafted checks [5]. Existing guidelines consider these aspects, but currently 
rely on human judgment, which is subjective and does not scale for larger vo- 
cabularies. The SKOS context, where vocabularies can be linked together on the 
Web, also brings issues hitherto unforeseen by traditional checking approaches. 

We aim at contributing to the ongoing community efforts to bridge that gap 
between model-level integrity constraints and domain-specific quality aspects. 
Our goal is to help taxonomists in identifying possible quality issues in SKOS 
vocabularies and to give them a set of computable quality checking functions 
that, in combination with the taxonomists' experience and domain expertise, 
can serve as quality indicators for vocabularies. Finding such quality issues also 
gives important feedback on the overall vocabulary design process and should, in 
the end, lead to better vocabularies. With defining computable quality checking 
functions we tackle the problem of quality assessment from an objective perspec- 
tive. Subjective perception of quality is not within the scope of this work. Our 
contribution can be summarized as follows: 

— We identified 15 quality issues for SKOS vocabularies by examining existing 
guidelines and formalized them into computable quality checking functions 
that identify possibly affected resources in a vocabulary. 

— With the qSKOS quality assessment tool we provide a reference implemen- 
tation of these functions. 

— We tested these functions by analyzing a representative set of 15 existing 
SKOS vocabularies to learn about possible quality issues. 

In the following, we will first discuss what "quality" means in the context of 
SKOS vocabularies and how it is currently supported by the SKOS specification 
and existing tools. Then wc introduce the quality issues we have identified and 
describe how we implemented them in the qSKOS quality assessment tool. Fi- 
nally, wc report on the results of an analysis we performed on 15 existing SKOS 
vocabularies and show that the quality issues we discussed arc real and can lead 
to the improvement of existing vocabularies. All supplemental materials'* are 
available online. 

* qSKOS: https://github.com/cmader/qSKOS/, wiki: https://github.com/cmader/ 
qSKDS/wiki, dataset: https://github.com/cmader/qSKDS-data/ 



2 Background and Related Work 



The problem of "vocabulary quality" is closely related to the more general one of 
"data quality" and has been discussed in data and information systems research 
(cf. [4]). Pipino et al. [16] argue that dealing with data quality should involve 
both "subjective perceptions of the individuals" and "objective measurements 
based on the data" . We see our work as a contribution to the latter. 

The SKOS specification does not mention the notion of quality, but defines in 
total six integrity conditions [13], each of which is a statement that defines under 
which circumstances data are consistent with the SKOS data model. For exam- 
ple, "a resource has no more than one value of skos :pref Label per language 
tag" . Tools that can check whether these conditions are met arc already avail- 
able: two of the six conditions are defined formally in the OWL representation 
of SKOS and can therefore be validated by any OWL reasoner. For validating a 
SKOS vocabulary against the other integrity conditions, one can use tools such 
as the PoolParty Thesaurus Consistency Checker^, or the Skosify^ validator, 
which can also correct some detected quality problems. 

Typical applications of controlled vocabularies are classification, indexing, 
auto-completion, query reformulation, or serving as a glossary. As we discussed in 
detail in earlier work [14] , these areas impose specific requirements on vocabulary 
features, such as structure, availability, and documentation. Quality aspects of 
controlled vocabularies have already been discussed in standardized guidelines [1, 
15], manuals [2,6,8,21], tutorials [18], and scholarly articles [5,12]. These most 
often rely on manual, precise analysis of individual statements in the data, as 
in [19]. Our work builds on this literature, but focuses on the less intellectually 
loaded checks, which can be automatized to assist vocabulary users or publishers. 

Data quality is also being discussed in Semantic Web and Linked Data re- 
search. Hogan et. al [9] identify four categories of common errors and shortcom- 
ings in RDF documents and Heath and Bizer [7] summarize best practices for 
publishing data on the Web. Ontology evaluation, i.e., measuring the quality of 
an ontology, has also been discussed extensively [23]. However, the authors focus 
on RDF datasets and ontologies in general. While we could use some criteria 
suggested here, such as consistent tagging of literals, these need to be completed 
by considering SKOS-specific properties. 

One issue when assessing the quality of SKOS data is the so-called "Open 
World Assumption" , which underlies the Web of Data itself. Established quality 
notions from closed-world systems, such as referential integrity or schema valida- 
tion, do not hold anymore, because available information may be incomplete and 
non-explicitly stated facts cannot be determined as true or false. Work-arounds, 
often ad- hoc, are thus currently used to evaluate quality in Linked Data sets [7], 
as is done in the (rule-based) SKOS tools mentioned above, or the Pellet ICV'^, 
which re-interprets OWL axioms with integrity constraint semantics. 

^ http: //demo . semEoitic-web . at : 8080/SkosServices/ check 
® http: //code . google . com/p/ skosif y/ 
^ http: //clarkparsia. com/pellet/icv/ 



3 Quality Issues in SKOS Vocabularies 



We identified an initial set of possible quality issues in SKOS vocabularies by 
reviewing literature, manually examining existing vocabularies, and focusing on 
issues that can be measured automatically. Some measures, such as hierarchy 
depth or node centrality, have been omitted due to lack of evidence on their 
general influence on vocabulary quality. We published our findings in the qSKOS 
wiki and requested feedback from experts via public mailing lists and informal 
face to face discussions. Based on the received responses, we translated a subset 
of these issues into computable quality checking functions. Each function takes 
a given SKOS vocabulary and an optional vocabulary namespace as input and 
finds all resources that match the corresponding quality issue. For the purpose 
of this work, we define a SKOS vocabulary as follows: 

Definition (SKOS Vocabulary). Let a SKOS vocabulary be a tuple of the 
form V = {IR,C,AC,SR,LV,CS), with IR = IcEXxircLfs : Resource^) be- 
ing the set of resources, C C IR with C = IcEXT{skos : ConcepiP') being 
the set of concepts, AC C C being the set of authoritative concepts, which 
are all concepts that are identified by URIs in the vocabulary namespace, SR = 
lEXT{skos : semanticRelatiorf) being the set of semantic relations asso- 
ciating concepts with one another, LV C IcEXT{rdfs : Literal?') being the 
set of untyped plain literals, and CS ~ IcEXxiskos :ConceptScheme^) be- 
ing the set of concept schemes. Further, we let V be the fully entailed RDFS 
interpretation of the underlying RDF graph. We enrich V by entailment of 
owl : inverseOf properties as well as instances of owl :TransitiveProperty and 
owl : SymmetricProperty defined by the formal OWL semantics of SKOS [13]. 

In the following, we explain the origins and design rationale for each quality 
issue and explain how the corresponding quality checking function works. For 
better readability and due to lack of space we provide only semi-formal defini- 
tions and refer to the source code of the qSKOS tool for further details. 

3.1 Labeling and Documentation Issues 

Omitted or Invalid Language Tags SKOS defines a set of properties that link 
resources with RDF Literals, which are plain text natural language strings with 
an optional language tag. This includes the labeling properties rdfs: label, 
skos :pref Label, skos : altLabel, skos :hiddenLabel and also SKOS docu- 
mentation properties, such as skos: note and subproperties thereof. Literals 
should be tagged consistently [23] , because omitting language tags or using non- 
standardized, private language tags in a SKOS vocabulary could unintentionally 
limit the result set of language-dependent queries. A SKOS vocabulary can be 
checked for omitted and invalid language tags by iterating over all resources in 
IR and finding those that have labeling or documentation property relations to 
plain literals in LV with missing or invalid language tags, i.e., tags that are not 
defined in RFC30668. 

® Tags for the Identification of Languages http://www.ietf.org/rfc/rfc3066.txt 



Incomplete Language Coverage The set of language tags used by the Uteral values 
linked with a concept should be the same for all concepts. If this is not the 
case, appropriate actions like, e.g., splitting concepts or introducing scope notes 
should be taken by the creators. This is particularly important for applications 
that rely on internationalization and translation use cases. Affected concepts 
can be identified by first extracting the global set of language tags used in a 
vocabulary from all literal values in LV , which are attached to a concept in C. 
In a second iteration over all concepts, those having a set of language tags that 
is not equal to the global language tag set arc returned. 

Undocumented Concepts Svenonius [20] advocates the "inclusion of as much 
definition material as possible" and the SKOS Reference [13] defines a set of 
"documentation properties" intended to hold this kind of information. To iden- 
tify all undocumented concepts, we iterate over all concepts in C and collect 
those that do not use any of these documentation properties. 

Label Conflicts The SKOS Primer [11] recommends that "no two concepts have 
the same preferred lexical label in a given language when they belong to the 
same concept scheme". This issue could affect application scenarios such as 
auto-completion, which proposes labels based on user input. Although these 
extra cases are acceptable for some thesauri, we generalize the above recommen- 
dation and search for all concept pairs with their respective skos : pref Label, 
skos : altLabel or skos : hiddenLabel property values meeting a certain similar- 
ity threshold defined by a function sim : LV x LV — >■ [0, 1]. The default, built-in 
similarity function checks for case-insensitive string equality with a threshold 
equal to 1. Label confiicts can be found by iterating over all (authoritative) 
concept pairs AC x AC, applying sim to every possible label combination, and 
collecting those pairs with at least one label combination meeting or exceeding a 
specified similarity threshold. We handle this issue under the Closed World As- 
sumption, because data on concept scheme membership may lack and concepts 
may be linked to concepts with similar labels in other vocabularies. 

3.2 Structural Issues 

Orphan Concepts arc motivated by the notion of "orphan terms" in the litera- 
ture [8], i.e., terms without any associative or hierarchical relationships. Checking 
for such terms is common in thesaurus development and also suggested by [15]. 
Since SKOS is concept-centric, we understand an orphan concept as being a con- 
cept that has no semantic relation sr G SR with any other concept. Although it 
might have attached lexical labels, it lacks valuable context information, which 
can be essential for retrieval tasks such as search query expansion. Orphan con- 
cepts in a SKOS vocabulary can be found by iterating over all elements in C 
and selecting those without any semantic relation to another concept in C . 

Weakly Connected Components A vocabulary can be split into separate "clus- 
ters" because of incomplete data acquisition, deprecated terms, accidental dele- 



tion of relations, etc. This can affect operations that rely on navigating a con- 
nected vocabulary structure, such as query expansion or suggestion of related 
terms. Weakly connected components are identified by first creating an undi- 
rected graph that includes all non-orphan concepts (as defined above) as nodes 
and all semantic relations SR as edges. "Tarjan's algorithm" [10] can then be 
applied to find all connected components, i.e., all sets of concepts that are con- 
nected together by (chains of) semantic relations. 

Cyclic Hierarchical Relations is motivated by Soergel et al. [18] who suggest a 
"check for hierarchy cycles" since they "throw the program for a loop in the 
generation of a complete hierarchical structure" . Also Hedden [8] , Harpring [6] 
and Aitchison et al. [2] argue that there exist common forms like, e.g., "generic- 
specific" , "instancc-of " or "whole-part" where cycles would be considered a log- 
ical contradiction. Cyclic relations can be found by constructing a graph with 
the set of nodes being C and the set of edges being all skos: broader relations. 

Valueless Associative Relations The ISO/DIS 25964-1 standard [1] suggests that 
terms that share a common broader term should not be related associatively if 
this relation is only justified by the fact that they are siblings. This is advocated 
by Hedden [8] and Aitchison et al. [2] who point out "the risk that thesaurus 
compilers may overload the thesaurus with valueless relationships" , having a 
negative effect on precision. This issue can be checked by identifying concept 
pairs C X C that share the same broader or narrower concept while also being 
associatively related by the property skos : related. 

Solely Transitively Related Concepts Two concepts that are explicitly related by 
skos :broaderTransitive and/or skos : narrowerTr sins it ive can be regarded 
a quality issue because, according to [13], these properties are "not used to make 
assertions" . Transitive hierarchical relations in SKOS are meant to be inferred 
by the vocabulary consumer, which is reflected in the SKOS ontology by, for 
instance, skos : broader being a subpropcrty of skos : broaderTransitive. This 
issue can be detected by finding all concept pairs C x C that arc directly related 
by skos : broaderTransitive and/or skos inarrowerTransitive properties but 
not by (chains of) skos: broader and skos : narrower subproperties. 

Omitted Top Concepts The SKOS model provides concept schemes, which are a 
facility for grouping related concepts. This helps to provide "efficient access" [11] 
and simplifies orientation in the vocabulary. In order to provide entry points to 
such a group of concepts, one or more concepts can be marked as top concepts. 
Omitted top concepts can be detected by iterating over all concept schemes 
in CS* and collecting those that do not occur in relations established by the 
properties skos : hasTopConcept or skos : topConceptOf . 

Top Concept Having Broader Concepts Allemang et al. [3] propose to "not in- 
dicate any concepts internal to the tree as top concepts" , which means that 
top concepts should not have broader concepts. Affected resources are found by 



collecting all top concepts that are related to a resource via a skos : broader 
statement and not via skos :broadMatch — mappings are not part of a vocabu- 
lary's "intrinsic" definition and a top concept in one vocabulary may perfectly 
have a broader concept in another vocabulary. 

3.3 Linked Data Specific Issues 

Missing In-Links When vocabularies are published on the Web, SKOS concepts 
become linkable resources. Estimating the number of in-links and identifying 
the concepts without any in-links, can indicate the importance of a concept. We 
estimate the number of in-links by iterating over all elements in AC and querying 
the Sindice^ SPARQL endpoint for triples containing the concept's URI in the 
object part. Empty query results are indicators for missing in-links. 

Missing Out-Links SKOS concepts should also be linked with other related con- 
cepts on the Web, "enabling seamless connections between data sets" [7]. Similar 
to Missing In-Links^ this issue identifies the set of all authoritative concepts that 
have no out-links. It can be computed by iterating over all elements in AC and 
returning those that are not linked with any non- authoritative resource. 

Broken Links As we discussed in detail in our earlier work [17], this issue is 
caused by vocabulary resources that return HTTP error responses or no response 
when being dereferenced. An erroneous HTTP response in that case can be 
defined as a response code other than 200 after possible redirections. Just as in 
the "document" Web, these "broken links" hinder navigability also in the Linked 
Data Web and and should therefore be avoided. Broken links are detected by 
iterating over all resources in /i?, dereferencing their HTTP URIs, following 
possible redirects, and including unavailable resources in the result set. 

Undefined SKOS Resources The SKOS model is defined within the namespace 
http://www.w3.Org/2004/02/skos/core#. However, some vocabularies use re- 
sources from within this namespace, which are unresolvable for two main reasons: 
vocabulary creators "invented" new terms within the SKOS namespace instead 
of introducing them in a separate namespace, or they use "deprecated" SKOS 
elements like skos : subject. Undefined SKOS resources can be identified by it- 
erating over all resources in LR and returning those (i) that are contained in 
the list of deprecated resources^" or (ii) are identified by a URI in the SKOS 
namespace but are not defined in the current version of the SKOS ontology. 



® http://sindice.com/ indexes the Web of Data, which is composed of pages with 
semantic markup in RDF, RDFa, Microformats, or Microdata. Currently it covers 
approximately 230M documents with over 11 billion triples. 

" See http://www.w3.Org/TR/skos-reference/#namespace 



4 Analysis of Existing SKOS Vocabularies 



Wc used the qSKOS quality assessment tool to find possible quality issues in 
existing SKOS vocabularies. From each quality checking function we obtained 
detailed reports listing possibly affected resources. 

4.1 Vocabulary Data Set 

Table 1 summarizes some basic statistical properties of our vocabulary selec- 
tion: the number of concepts and authoritative concepts, all skos : pref Label, 
skos : altLabel, and skos :hiddenLabel relations involving concepts (Concept 
Labels), all asserted semantic relations, as well as the number of concept schemes. 
From these properties we can, for instance, already see that approximately 
3,000 DBpedia Categories concepts do not have labels (e.g.. Category : South_ 
Korean_social_scientists), which is an indicator for missing natural language 
descriptions in some Wikipedia categories. 



Table 1. Analyzed SKOS vocabularies 



Vocabulary 


Abbreviation 


Version/ 
Last Modified 


Concepts 


Authoritative 
Concepts 

Concept 
Labels 


Semantic 
Relations 


Concept 
Schemes 


United Nations Agricultural Thesaurus 


AGROVOC 


1.3 


32,035 


32,035 620,629 


65,934 


1 


DBpedia Categories 


DBpedia 


3.7 


743,410 743,410 740,352 1,490,316 





The EU's Multilingual Thesaurus 


Eurovoc 


5.0 


6,797 


6,797 457,788 


18,491 


128 


Geonames Ontology 


Geonames 


2.2.1 


671 


671 671 





9 


Gemeenschappelijke Thesaurus Audiovi- 


GTAA 


2010/08/25 171,991 171,991 178,776 


50,892 


9 


suele Archieven 














Integrated Public Sector Vocabulary 


IPSV 


2.00 


4,732 


3,080 7,945 


13,843 


3 


Library of Congress Subject Headings 


LCSH 


2012/03/29 443,164 408,009 750,219 


598,134 


1 


Austrian Armed Forces Thesaurus 


LVAk 


0.9 


13,411 


13,411 17,250 


16,346 





Middle Kingdom Tombs of Ancient Egypt 


Meketre 


2011/07/07 


422 


422 569 


1,698 


2 


Thesaurus 














Medical Subject Headings 


MeSH 


[221 


24,626 


24,626 150,617 


38,858 





North American Industry Classification 


NAICS 


2012 


4,175 


2,213 


8,684 


1 


System 














New York Times People 


NYTP 


2010/06/22 


4,979 


4,979 4,979 





1 


University of Southampton Pressinfo 


Pressinfo 


2011/02/24 


1,125 


1,125 








Peroxisome Knowledge Base 


PXV 


1.6 


2,112 


1,686 3,628 


2,695 


1 



Thesaurus for Economics 



STW 



8.10 



25,107 6,789 58,441 91,816 3 



4.2 Results and Discussion 



The results of this analysis are summarized in Table 2, which shows the absolute 
number of possibly affected resources for each quality checking function and 
vocabulary. Numbers marked with an asterisk (*) were obtained by extrapolating 
from subsets containing 5% of the respective vocabulary resources. 



Table 2. Results of the quality checking functions 
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pedia 
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VV 
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s 




ketre 


HS 


ICS 


TP 


s. 
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S 


Issue 


AG 


CQ 
Q 


a 







IPS 




i 


Me 


1 


NA 


NY 




PX 


Eh 


Omitted or Invalid Language Tags 








219 











100,316 


13.411 





23,950 





n 


1,224 


1,578 


2 


Incomplete Language Coverage 


32,035 





6370 











n 





420 








n 








25,050 


Undocumented Concepts 


32,035 


743,410 


5,341 





96,850 


4,551 


342,848 


13,411 


422 


1,807 


3,2,59 


4,09 


4 1,125 


1,918 


23,752 


Label Conflicts 


2,949 





48 


18 


12,404 





10,862 


13 


4 



















Orphan Concepts 





77,062 


7 


671 


162,000 





173,149 


21 











4.97 


9 1,125 


2 


70 


Weakly Connected Components 


4 


1,,')06 


4 





621 


1 


22,343 


11 




4 


1 








10 


141 


Cyclic Hierarchical Relations 





1,1.32 














n 







4 

















Valueless Associative Relations 


282 


8,8.39 


6 





9.448 


2.53 










550 














5,004 


Solely Transitively Related Concepts 








2,652 

















36 





2,189 














Omitted Top Concepts 








1 


9 


9 





1 














1 








2 


Top Concept Having Broader Concepts 



































n 





1 





Missing In-Links 


32,035 


733,800 


6,796 


19 


171,980* 


3,080 


408,000* 


13,411 


422 


24,625 


2,213 


20 


1.125 


1,686 


6,781 


Missing Out-Links 


32,035 


743,410 


6,797 


671 


171,991 





344,054 


13,411 


273 


24,626 


1 





1,116 


1,0-16 





Broken Links 


238 


0* 


0* 








1 


780 





425 


1 


3,169 




11 


163 




Undefined SKOS Resources 

















1 











1 


















We found labeling and documentation issues in all vocabularies. MeSH, PXV, 
Pressinf o, and LVAk omit language tags with their labeling properties, LCSH 
with the skos :iiote property. STW docs not use language tags with 2 instances of 
skos : definition. AGROVOC covers 25 languages but no single concept is labeled 
in all languages, in Meketre all concepts have English but only some of them 
French labels assigned. STW, which is expressed mainly in English and German, 
has many concepts with incomplete language coverage because it (i) links to non- 
authoritative concepts that are only labeled in German and (ii) uses the private, 
but valid language tag x-other with some of its concept labels. Geoncunes, which 
defines a concept scheme of "feature codes" , is the only vocabulary in our dataset, 
which has at least one documentation property assigned to all of its concepts. 
All other vocabularies have a significant number of undocumented concepts. We 
also detected possible label conflicts in half of the vocabularies. PXV, for instance, 
uses the string "primary peroxisomal enzyme deficiency" with two concepts in 
the same concept scheme, but once with a skos :pref Label and another time 
with a skos : altLabel property. In NAICS we could not detect any labeling issues 
but found that it expresses statements with skosxl : pref Label as predicate and 
plain literals as object, which contradicts the SKOS-XL^^ specification. 

The SKOS extension for Labels (SKOS-XL) provides additional support for identi- 
fying, describing and linking lexical entities. [13] 



When analyzing the vocabularies for structural issues, we found that certain 
results can be seen as indicators for the types of vocabularies. In the Pressinf o, 
Geonaines, and NYTP vocabulary, all concepts are orphan concepts, which means 
that these vocabularies are authority files rather than thesauri or taxonomies. 
This also implies that these vocabularies have no weakly connected components. 
GTAA is a mixture of name authority file (approx. 162K concepts) and thesaurus 
(approx. lOK concepts). The 70 orphan concepts in STW are deprecated concepts. 

Three vocabularies show no weakly connected components (WCCs) because 
all concepts are orphan concepts and thus no relations between them are estab- 
lished. Two vocabularies (IPSV, NAICS) consist of only one "giant component", 
which is often considered the ideal vocabulary structure. STW forms one giant 
component (containing 24,572 concepts), but has also 140 additional WCCs, 
which all contain linked authoritative and non- authoritative concepts. All other 
vocabularies split into several clusters of semantically related concepts, each of 
which represents a certain subtopic. Eurovoc. for instance, has 4 WCCs, con- 
taining 4, 5, 6 and 6775 concepts. In the large WCC it uses a custom ontology 
to organized numerous micro-thesauri and domains and cross-connects concepts 
by skos: related properties. However, this is not the case for the three small 
WCCs, indicating a quality flaw. WCCs divide the Meketre vocabulary into dif- 
ferent topics, e.g., museums or concepts reserved for internal use. GTAA consists 
of 621 highly unbalanced WCCs. One component contains 8413 subjects from a 
thesaurus with carefully curated semantic relations. Most of the other compo- 
nents contain less than 10 entities from other categories, e.g., locations, person 
names, and genres, for which the "traditional" information management prac- 
tices involve much less explicit linking. PXV splits into 10 topic-related WCCs, 
such as "deficiencies", "defects" or "signals". Some of the 11 concept clusters 
contained in the LVAk thesaurus are obviously forgotten test data. 

Hierarchical cycles are not a common issue except in the collaboratively cre- 
ated DBpedia vocabulary, where many concepts have reflexive skos: broader 
relations. The cycles in MeSH and LVAk could, in our opinion, be resolved by re- 
placing hierarchical with associative relations or synonym definitions. Valueless 
associative relations occur in 8 vocabularies, with their total number being rela- 
tively low compared to the total number of all semantic relations in the respective 
vocabularies. Solely transitive related concepts occur in 3 vocabularies, estab- 
lishing relations using properties that, according to [13], should not be asserted 
directly. This indicates a possible misinterpretation of the SKOS specification 
and could result in a loss in recall on hierarchical queries. GTAA and Geonames 
omit top concepts in all concept schemes they define. Eurovoc uses 128 concept 
schemes but has one without top concept, which simply contains all concepts 
defined in the vocabulary. Such an "umbrella concept scheme" without top con- 
cept is also present in LCSH and NYTP. Only the PXV vocabulary is affected by 
top concepts having broader concepts in its current version. In earlier versions 
more of them could be found which were, according to the vocabulary creator, 
abandoned but still available in the triple store, probably caused by some bug 
in the vocabulary management software. 



The difference between the number of concepts and the number of authori- 
tative concepts in Table 1 already indicates which vocabularies are linked with 
other SKOS vocabularies. However, except NYTP and Geonames, no vocabulary 
has a high number of estimated in-links from other web resources. Also the 
number of out-links is rather low: NYTP, IPSV, and STW are the three exceptions, 
which are fully linked to other Web resources. The one concept with missing 
out-links in NAICS is the object of a skos :broaderTransitive relation. One 
reason for a high number of missing out-links is that the links were not available 
in the main thesaurus file, which is at least the case for AGROVOC. Even though 
we could not determine the exact number of broken links because of the large 
number of links to resolve (over 400K in Eurovoc, over 500K in LCSH), we found 
that broken links are a common issue in most vocabularies. Undefined SKOS 
resources seem to be a minor issue, because we could only find two of them in all 
vocabularies: MeSH introduces skos : annotation and IPSV uses the deprecated 
skos :pref Symbol property. 

5 Conclusions and Future Work 

We presented possible quality issues in SKOS vocabularies and described how 
we implemented them as quality checking functions in our qSKOS quality assess- 
ment tool. We analyzed a representative set of existing SKOS vocabularies and 
found issues in all of them. Labeling and documentation issues were omnipresent 
and also structural issues, which require further investigation by the vocabulary 
maintainers, were found in most vocabularies. Although SKOS is designed for 
Linked Data, many existing vocabularies still resemble their closed-system ori- 
gin, which results in a relatively low number of in- and out-links. Broken links 
are a major issue and call for synchronization mechanisms in order to maintain 
navigability between concepts in different vocabularies. 

We are aware that these issues are purely quantitative quality indicators. 
To learn more about the real- world impact of our work like, e.g., the relative 
importance of the identified quality issues, we will conduct a qualitative follow- 
up study, in which we discuss these results with more taxonomists. Wc will also 
set up a Web-based SKOS quality checking service to further collect community 
feedback and enhance the issues list. These enhancements may encompass a finer- 
grained evaluation on some issues like, e.g., documentation quality indicators 
which could also include average number or length of documentation statements 
and their standard deviation across concepts. 

We already reported initial results from our analysis to some of the main- 
tainers of the vocabularies we analyzed. At the time of this writing, we know 
that our findings led to improvements in at least two SKOS vocabularies. 
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